Goto

Collaborating Authors

 Translational Bioinformatics



China races to build record biobank to rival U.S. drugs research

The Japan Times

China races to build record biobank to rival U.S. drugs research Biobanks store masses of biomedical data such as clinical records, genome sequences and other long-term health metrics that research and drug development depend on. As a fledgling researcher in U.S., Zhang Li was struck by the efficiency of extracting human tissue in the morning and mining it for data the same afternoon. Such a streamlined process had been missing from his years of training as a bio data scientist in China. Inspired, he returned home to Beijing to join the Chinese Institute for Brain Research and launch a national database that will collect blood and DNA samples from 33,000 children to help identify patterns of brain disease and their risk factors. "Biomedical data is extremely valuable and is fundamental for us to find solutions to diseases and to delay aging," said Zhang, surrounded by robotic arms carefully organizing blood samples.


PROSPECT: Labeled Tandem Mass Spectrometry Dataset for Machine Learning in Proteomics

Neural Information Processing Systems

Proteomics is the interdisciplinary field focusing on the large-scale study of proteins. Proteins essentially organize and execute all functions within organisms. Today, the bottom-up analysis approach is the most commonly used workflow, where proteins are digested into peptides and subsequently analyzed using Tandem Mass Spectrometry (MS/MS). MS-based proteomics has transformed various fields in life sciences, such as drug discovery and biomarker identification. Today, proteomics is entering a phase where it is helpful for clinical decision-making. Computational methods are vital in turning large amounts of acquired raw MS data into information and, ultimately, knowledge.


OpenProteinSet: Training data for structural biology at scale

Neural Information Processing Systems

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.


Biconvex Biclustering

arXiv.org Machine Learning

This article proposes a biconvex modification to convex biclustering in order to improve its performance in high-dimensional settings. In contrast to heuristics that discard a subset of noisy features a priori, our method jointly learns and accordingly weighs informative features while discovering biclusters. Moreover, the method is adaptive to the data, and is accompanied by an efficient algorithm based on proximal alternating minimization, complete with detailed guidance on hyperparameter tuning and efficient solutions to optimization subproblems. These contributions are theoretically grounded; we establish finite-sample bounds on the objective function under sub-Gaussian errors, and generalize these guarantees to cases where input affinities need not be uniform. Extensive simulation results reveal our method consistently recovers underlying biclusters while weighing and selecting features appropriately, outperforming peer methods. An application to a gene microarray dataset of lymphoma samples recovers biclusters matching an underlying classification, while giving additional interpretation to the mRNA samples via the column groupings and fitted weights.


Multi-modal Transfer Learning between Biological Foundation Models

Neural Information Processing Systems

Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple \rna transcript isoforms originate from the same gene (i.e.



Appendix ProteinShake: Building datasets and benchmarks for deep learning on protein structures

Neural Information Processing Systems

Table 3: Comparison of models trained with different representations of protein structure across various tasks, on a random data split . The optimal choice of representation depends on the task. Shown are mean and standard deviation across four runs with different seeds. Table 4: Comparison of models trained with different representations of protein structure across various tasks, on a sequence data split . Table 5: Comparison of models trained with different representations of protein structure across various tasks, on a structure data split .